Index-Based R-S Similarity Joins
نویسندگان
چکیده
Similarity Joins are some of the most useful and powerful data processing operations. They retrieve all the pairs of data points between different data sets that are considered similar within a certain threshold. This operation is useful in many situations, such as record linkage, data cleaning, and many other applications. An important method to implement efficient Similarity Joins is the use of indexing structures. The previous work, however, only supports self joins or requires the joint indexing of every pair of relations that participate in a Similarity Join. We present an algorithm that extends a previously proposed index-based algorithm (eD-Index) to support Similarity Joins over two relations. Our approach operates over individual indices. We evaluate the performance of this algorithm, contrast it with an alternative approach, and investigate the configuration of parameters that maximize performance. Our results show that our algorithm significantly outperforms the alternative one in terms of distance computations, and reveal interesting properties when comparing execution time.
منابع مشابه
A Fast Algorithm for high-dimensional Similarity Joins
Many emerging data mining applications require a similarity join between points in a highdimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of nd...
متن کاملEmbedding Similarity Joins into Native XML Databases
Similarity joins in databases can be used for several important tasks such as data cleaning and instance-based data integration. In this paper, we explore ways how to support such tasks in a native XML database environment. The main goals of our work are: a) to prove the feasibility of performing tree similarity joins in a general-purpose XML database management system; b) to support stringand ...
متن کاملSimilarity Join in Metric Spaces Using eD-Index
Similarity join in distance spaces constrained by the metric postulates is the necessary complement of more famous similarity range and the nearest neighbor search primitives. However, the quadratic computational complexity of similarity joins prevents from applications on large data collections. We present the eD-Index, an extension of D-index, and we study an application of the eDIndex to imp...
متن کاملPlug&Join: An easy-to-use Generic Algorithm for Efficiently Processing Equi and Non-Equi Joins
This paper presents Plug&Join, a new generic algorithm for efficiently processing a broad class of different types of joins in an extensible database system. Plug&Join is not only designed to support equi joins, temporal joins, spatial joins, subset joins and other types of joins, but in contrast to previous algorithms it can be easily customized and it allows efficient processing of new types ...
متن کاملHigh-Dimensional Similarity Joins
Many emerging data mining applications require a similarity join between points in a high-dimensional domain. We present a new algorithm that utilizes a new index structure, called the -kdB tree, for fast spatial similarity joins on high-dimensional points. This index structure reduces the number of neighboring leaf nodes that are considered for the join test, as well as the traversal cost of f...
متن کامل